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1.  EXECUTIVE  SUMMARY 


Approximate  transforms  that  closely  follow  the  discrete  Fourier  transform  (DFT)  have  been 
studied  and  found.  The  approximate-DFT  (a-DFT)  transforms  are  derived  to  have  acceptable 
performance  in  terms  of  achieving  spatial  multi-beams.  It  has  been  found  that  a-DFTs  achieve 
almost  DFT  performance  albeit  at  a  multiplier  count  of  zero.  Therefore  the  transforms  reduce  the 
well-known  0(NlogN)  multiplier  complexity  of  fast  Fourier  transform  (FFT)  algorithms  to  zero 
for  a  N- point  transform.  Sparse  factorization  for  each  derived  matrix  has  also  been  computed  to 
reduce  the  adder  complexity  involved. 

Approximate  transforms  for  8-,  16-,  32-,  and  64-point  transforms  have  been  found  which  have 
zero  multiplier  complexity.  Frequency  response  analysis  have  been  given  for  each  case  depicting 
the  error  performance. 

A  2.4  GHz,  16-element  receive-mode  multi -beamforming  system  has  been  implemented  in  lab 
for  verifying  the  performance  of  the  beams  generated  by  the  approximate  transforms  for  8-  and 
16-point  transforms.  Beam  measurements  have  been  obtained  and  are  reported  for  the  8-  and  16- 
point  cases.  Beam  patterns  pertaining  to  the  respective  exact  transform  have  also  been  measured 
for  comparison  purpose.  It  has  been  verified  that  the  beam  patterns  corresponding  to  a-DFT 
transforms  closely  follow  the  beams  obtained  for  the  respective  exact  version. 
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2.  INTRODUCTION 


FFTs  are  fast  algorithms  for  the  computation  of  the  DFT  with  low  computational  complexity. 
FFT  is  a  famous  algorithm  in  digital  signal  processing  (DSP).  FFT  owes  its  popularity  to  the  fact 
that  the  parent  algorithm  -  the  DFT  -  is  of  critical  importance  in  a  wide  range  of  applications 
[1],  such  as  wireless  communications,  data  networks,  sensor  networks,  cognitive  radio,  radar  and 
beamforming,  imaging,  filtering,  correlation  and  radio-astronomy.  FFTs  efficiently  compute  an 
N  -point  DFT,  where  the  DFT  itself  is  an  N  xN  linear  transform  that  splits  N  -samples  of  a  signal 
to  its  constituent  frequency  components. 

The  FFT  is  widely  used  in  a  massive  collection  of  applications  realized  using  embedded  systems 
based  on  fixed-point  digital  arithmetic  (for  example,  two’s  complement  number  system).  The 
FFT  is  an  important  algorithm  for  communications,  radar  and  sensor  systems.  The  FFT 
transforms  perfectly  in  theory  within  these  systems  in  an  ideal  world.  The  exact  FFT  is  never 
realized  because  digital  architectures  are  subject  to  fixed-point  eff  such  as  quantization, 
rounding,  saturation  and  truncation.  Nonetheless,  FFTs  are  a  component  of  a  lossy  scheme  as  a 
result  of  practical  hardware  implementation.  Multi-beamforming  is  achieved  in  today’s  systems 
using  FFT  algorithms  to  perform  the  DFT  operation.  Fast  Fourier  transform  is  used  for  the 
computation  of  the  DFT  with  low  computational  complexity.  Computational  complexity 
associated  with  performing  an  N  -point  DFT  operation  is  0(N2).  FFT  reduces  the  above 
computational  complexity  to  0(N  \ogi  N ).  Cooley-Tuckey,  Duhamel  and  Winograd  are  some 
popular  FFT  algorithms  that  can  be  found  in  the  literature  [1-3].  Table  1  tabulates  the  associated 
complexities  of  those  popular  DFT  algorithms  for  8-point  and  16-point  transforms. 


Table  1.  Arithmetic  Complexity  Comparison 


A- 

DFT 

Algorithm 

Complex 

Adders 

Complex 

Multipliers 

Real 

Adders 

Real 

Multipliers 

Lower 

Bound 

Heidermann 

8 -point 

Radix- 8 

24 

2 

58 

6 

4 

Cooley-Tukey  Radix-2 

24 

5 

73 

15 

4 

Winograd  (8 -point) 

26 

2 

62 

6 

4 

16- 

Radix-2 

64 

17 

213 

51 

20 

point 

Radix-4 

64 

10 

178 

30 

20 

Split-Radix 

64 

10 

178 

30 

20 

Winograd  (16-point) 

74 

10 

198 

30 

20 

2.1  DFT-based  Multi-beams 

One  of  the  most  important  applications  of  the  FFT  in  military  systems  would  be  the  realization  of 
radio  frequency  (RF)  antenna  array  processing  systems  for  electronically-steerable  multi-beam 
transmit  and  receive  aperture  arrays.  Such  multi-beam  antennas  are  extremely  important  for  RF 
sensing,  communications,  and  radar  systems,  such  as  active  electronically  scanned  array  (AESA) 
radars  and  digital  array  radar  (DAR).  The  application  of  an  N-point  FFT  along  the  array  samples, 
at  each  time  frame,  for  a  uniform  linear  array  (ULA)  of  antenna  elements  (see  Figure  1)  yields  N 
number  of  simultaneous  RF  beams.  Using  available  FFT  algorithms,  the  computational 
complexity  (complex-multiplier  and  complex-adder  complexity)  for  an  N-point  ULA  is 
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0(N  log2  N ).  For  rectangular  apertures  of  size  N  x  N  antennas,  the  application  of  an  N  point  2-D 
FFT,  at  each  time  sample,  spatially  across  the  aperture  yields  N2  independent  RF  beams  (see 
Figure  2).  In  such  a  system,  a  2-D  DFT  is  computed  on  the  rectangular/square  array  by 
computing  N  1-D  N-point  DFTs  along  rows,  then  finding  another  N  1-D  DFTs  along  columns, 
taking  the  outputs  of  the  row- wise  DFTs  as  the  inputs  to  the  column- wise  transforms.  In  terms  of 
TV-point  DFT  cores,  we  need  2N  TV-point  cores  to  compute  a  single  N  x  N  2-D  transform.  The 
hardware  complexity  associated  with  achieving  such  N2  beams  for  an  N  x  N  rectangular  aperture 
is  0(N2  log2  N2). 


Figure  2:  Operation  of  N  Point  2-D  FFT  for  obtaining  N2  Simultaneous  Beams 
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3.  APPROXIMATE-DFT  ALGORITHMS 


The  exact  FFT  is  never  realized  because  digital  architectures  are  subject  to  fixed-point  effects 
such  as  quantization,  rounding,  saturation  and  truncation.  Nonetheless,  FFTs  are  a  component  of 
a  lossy  scheme  as  a  result  of  practical  hardware  implementation.  Rather  than  aim  for  an  exact 
DFT  via  the  infinite-precision  realization  of  an  FFT  only  to  live  with  the  residual  errors  that  are 
unavoidable  in  practice,  it  makes  sense  to  allow  a  tolerable  deviation  and  find  an  approximation 
to  the  DFT  that  will  have  a  tiny  amount  of  deviation  in  its  filterbank  responses.  By  doing  so, 
approximate-DFTs  can  achieve  circuit  complexities  and  power  consumption  that  are 
substantially  lower  than  that  of  best  available  FFT  cores. 

Approximate-DFT  (a-DFT)  algorithms  compute  the  DFT  at  significantly  lower  circuit  area, 
critical  path  latency,  and  power  consumption  for  a  particular  very-large-scale  integration  (VLSI) 
platform.  The  algorithms  are  trained  to  find  acceptably  small  deviations  in  the  DFT  filterbank 
responses  in  the  stopband  of  each  filter  to  achieve  a  reduction  in  complexity  and  power 
consumption. 

Implemented  approximate-DFT  matrices  are  multiplierless.  Further,  the  adder  complexity  is 
reduced  through  matrix  factorization.  The  lower  bounds  are  well  defined  for  FFT  algorithms  and 
since  this  approach  is  an  approximation  of  the  exact  DFT,  such  lower  FFT  bounds  are  no  longer 
relevant.  For  example,  consider  y  =  1 .01 3xi  -  0.999x2  requires  extensive  multiplication  hardware 
while  y  ~  xi  -  X2  requires  no  multiplication  hardware.  Applications  that  utilize  the  fixed-point 
FFT,  which  can  tolerate  a  baseline  error  level,  can  be  replaced  with  these  a-DFTs. 

Replacement  of  the  FFT  with  the  a-DFTs  can  bring  down  the  computational  complexity  of  an 
N-element  N-beamfomer’s  complexity  from  0(N  log2  N)  to  zero  and  for  an  N  x  N  rectangular 
aperture,  from  0(N2  log2  N2)  to  zero.  With  the  corresponding  sparse-factorization,  this  reduction 
will  be  achieved  without  increasing  the  adder  complexity.  If  a  fully  parallel  implementation  of  a 
digital  multiplier  is  k  times  larger  than  a  parallel  adder  circuit,  then  it  can  be  shown  that,  on  this 
approach,  for  large  N  ,  the  percentage  saving  of  VLSI  real-estate  due  to  adoption  of  a 
multiplierless  FFT  approximation  is  asymptotic  to  k/(l  +  k).  In  brief,  this  approach  works  by 
accommodating  a  small  and  bounded  (tolerable)  computational  error  -  which  in  turn,  leads  to 
low-complexity  multi-  beam  aperture  arrays. 

The  subsequent  subsections  would  introduce  the  approximate  transforms  that  have  been  found 
for  8-,  16-,  32-,  and  64-point  transforms. 
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3.1  8-point  Approximate  DFT  Algorithm 

The  matrix  form  of  the  8-point  DFT  approximation  found  is  given  in  (1)  [4,  5], 
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It  can  be  seen  that  Fs  has  only  elements  consisting  of  0,  ±1,  ±2  which  can  be  realized  using  only 
adder  and  bit-shift  operations,  implying  the  use  of  zero  multipliers. 

Fs  can  be  factorized  to  further  reduce  the  adder  complexity: 


A 

Fj  =  Px  diag{l2>  Ai,  Ag)  xDjx  diag( la-  A4)  x  Di  x  diag( E4.  A2)  x  Bg. 


where  ln  is  the  identity  matrix  of  order  n  and  ®  denotes  the  Kronecker  product. 
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Di  =  diag(  1, 1, 1, 1, 1, 1/2, 1, 1/2),  D2  =  diag(  1, 1, 1  ,j,  1  ,j,j,  1). 
P  =  [ei|e5[e3|ec[e2|e&Me7]7' 


where  e,  is  the  8-point  column  vector  having  a  1  at  the  ith  position  and  0  elsewhere. 

Adders  Only  Signal  Flow  Graph  for  8-point  a-DFT 

Due  to  the  coefficients  of  Fs  being  small  integer  coefficients,  Eq.  (1)  can  be  implemented  such 
that  the  system  contains  only  adders.  The  adders  only  signal  flow  graph  is  shown  in  Figure  3. 
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Frequency  Responses  and  Errors 

Closed- form  beam  patterns  obtained  using  the  8-point  a-DFT  algorithm  and  FFT  are  shown  in 
Figure  4.  Figure  4  (a)  shows  the  exact  DFT  beams  and  Figure  4  (b)  shows  the  beams  obtained 
using  the  8-point  a-DFT.  Figure  4  (c)  shows  the  error  between  the  two  transforms.  The  1-D 
closed- form  beam  patterns  obtained  using  the  8-point  a-DFT  algorithm  and  FFT  for  a  Nyquist 
spaced  ULA  are  shown  in  Figure  5. 


Figure  4:  Closed-form  Beam  Patterns  obtained  using  the  8-point  a-DFT  Algorithm  and 

FFT 

(a)  Exact  DFT  Beams,  (b)  beams  obtained  using  the  8-point  a-DFT,  and  (c)  error  between  the 

two  transforms. 
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Figure  5:  1-D  Closed-form  Beam  Patterns  obtained  using  the  8-point  a-DFT  Algorithm 

and  FFT  for  a  Nyquist  spaced  ULA 

(a)  Exact  D FT  Beams,  (b)  beams  obtained  using  the  8-point  a-DFT,  and  (c)  error  between  the 

two  transforms. 


Comparison  of  8-point  a-DFT  with  Reduced  Precision  FFT 

It  would  be  logical  to  investigate  the  performance  of  implementing  the  exact  FFT  algorithms 
with  its  twiddle  factors  W'fc  heavily  undersampled  (by  reducing  precision  of  the  coefficients). 

it  /  k  pj2jrfc/jV 

Here,  JV  where  0  <  k  <  N  -  1  and  N  is  the  size  of  the  DFT.  The  signal  flow  graph 

for  8-point  radix-2  FFT  algorithm  is  shown  in  Figure  6.  The  places  where  the  twiddle  factors  are 
involved  in  the  signal  flow  graph  are  highlighted  in  red.  When  the  precision  of  these  coefficients 
is  reduced,  the  hardware  complexity  will  reduce  with  a  cost  of  reduction  of  the  accuracy  output 
frequency  response.  By  comparing  the  performance  of  the  filter  bins,  one  would  be  able  to  see 
the  role  of  the  a-DFT,  which  is  multiplier  free  (with  coefficients  having  only  bit  shifts).  It  can 
also  be  seen  that  the  performance  is  much  better  than  when  using  a  lower  precision 
implementation  of  the  FFT.  The  red.  When  the  precision  of  these  coefficients  is  reduced,  the 
hardware  complexity  will  reduce  with  a  cost  of  reduction  of  the  accuracy  output  frequency 
response.  By  comparing  the  performance  of  the  filter  bins,  one  would  be  able  to  see  the  role  of 
the  a-DFT,  which  is  multiplier  free  (with  coefficients  having  only  bit  shifts).  It  can  also  be  seen 
that  the  performance  is  much  better  than  when  using  a  lower  precision  implementation  of  the 
FFT.  The  plots  compare  the  frequency  responses  of  exact  FFT  and  the  proposed  a-DFT 
algorithm  for  each  bin  of  the  transform.  For  comparison  purposes,  we  have  considered  a  reduced 
precision  of  4-bits  for  each  frequency  bin  for  the  exact-DFT  twiddle  factors. 
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Normalized  Frequency 

Figure  7:  Output  Comparison  for  Bin  1 
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Figure  8:  Output  Comparison  for  Bin  2 
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Figure  9:  Output  Comparison  for  Bin  3 
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Figure  10:  Output  Comparison  for  Bin  4 
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Figure  11:  Output  Comparison  for  Bin  5 
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Figure  12:  Output  Comparison  for  Bin  6 
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Figure  13:  Output  Comparison  for  Bin  7 
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Figure  14:  Output  Comparison  for  Bin  8 
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3.2  16-point  Approximate  DFT  Algorithm 


Fi6  denotes  the  16-point  a-DFT  matrix.  For  ease  of  illustration,  the  matrix  is  divided  into  four 
quadrants  as  shown  in  Eq.  (2). 
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Fi6  can  be  factorized  to  further  reduce  the  adder  complexity.  The  factorization  is  comprised  of  6 
stages,  which  is  given  by 


Fi6=  W5W4W3W2D1W1. 


Matrices  pertaining  to  the  factorization  stages  are  shown  below. 
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The  matrix  factors  l/l/i,2,3,4,5  consist  of  sparse  matrices  having  non-zero  elements  -2,- 1,1, 2  only,  and 
Di  =1/2  diag(l,  1,  1,  1,  1,  1,  1,  1,  1,  j,  j,  j,  j,  j,  j,  j). 
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Adders  Only  Signal  Flow  Graph  for  16-point  a-DFT 

The  factorization  for  Fi6  can  be  used  to  develop  the  16-point  a-DFT  matrix  which  is  shown  in 
Figure  15. 
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Figure  15:  Signal  Flow  Graph  of  16-point  a-DFT 
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Frequency  Responses  and  Errors 


The  1-D  closed- form  beam  patterns  obtained  using  the  16-point  a-DFT  algorithm  and  FFT  for  a 
Nyquist  spaced  ULA  are  shown  in  Figure  16. 
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Figure  16:  1-D  Closed-form  Beam  Patterns  obtained  using  the  16-point  a-DFT  Algorithm 

and  FFT  for  a  Nyquist  spaced  ULA 

(a)  Exact  D FT  Beams,  (b)  beams  obtained  using  the  proposed  16-point  a-DFT,  and  (c)  error 

between  the  two  transforms. 


Closed- form  beam  patterns  obtained  using  the  16-point  a-DFT  algorithm  and  DFT  for  a  Nyquist 
spaced  ULA  are  shown  in  Figure  17. 


Figure  17:  Closed-form  Beam  Patterns  obtained  using  the  16-point  a-DFT  Algorithm  and 

DFT  for  a  Nyquist  spaced  ULA 

(a)  Exact  DFT  Beams,  (b)  beams  obtained  using  the  proposed  16-point  a-DFT,  and  (c)  error 

between  the  two  transforms. 

Comparison  of  16-point  a-DFT  with  Reduced  Precision  FFT 

The  comparison  for  all  the  exact  FFT  bins  and  the  a-DFTs  with  reduced  precision 
implementation  of  the  DFT  coefficient  is  repeated  for  the  16-point  case.  The  following  plots 
compare  the  frequency  responses  of  exact  FFT  and  the  proposed  a-DFT  algorithm  for  each  bin  of 
the  transform.  For  comparison,  we  have  considered  a  reduced  precision  of  4-bits  for  each 
frequency  bin  for  the  exact-DFT,  twiddle  factors. 
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Normalized  Frequency 

Figure  18:  Output  Comparison  for  Bin  1 
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Figure  19:  Output  Comparison  for  Bin  2 
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Figure  20:  Output  Comparison  for  Bin  3 
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Figure  21:  Output  Comparison  for  Bin  4 
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Figure  22:  Output  Comparison  for  Bin  5 
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Figure  23:  Output  Comparison  for  Bin  6 
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Figure  24:  Output  Comparison  for  Bin  7 
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Figure  25:  Output  Comparison  for  Bin  8 
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Figure  26:  Output  Comparison  for  Bin  9 
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Figure  27:  Output  Comparison  for  Bin  10 
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Figure  28:  Output  Comparison  for  Bin  11 


-3-2-10123 
Normalized  Frequency 


Figure  29:  Output  Comparison  for  Bin  12 
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Figure  30:  Output  Comparison  for  Bin  13 
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Figure  31:  Output  Comparison  for  Bin  14 


24 

Approved  for  public  release;  distribution  is  unlimited. 


Magnitude  [dB]  Magnitude  [dB] 


-3-2-10123 
Normalized  Frequency 


Figure  32:  Output  Comparison  for  Bin  15 
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Figure  33:  Output  Comparison  for  Bin  16 
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3.3  32-point  Approximate  DFT  Algorithm 

Equation  (3)  shows  the  32-point  a-DFT  transform.  For  the  sake  of  convenience,  F32  is  divided 
into  four  16x16  matrices. 
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F32  can  be  factorized  to  reduce  the  adder  complexity.  This  derived  factorization  consists  of  eight 
stages  and  is  given  by 


F32=W8  Wi  W6  W5  HA  1/1/3  W2  l/1/i , 


where  W^s  (k=  1 , 8)  are  shown  below. 
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Approved  for  public  release;  distribution  is  unlimited. 
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Approved  for  public  release;  distribution  is  unlimited. 
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Approved  for  public  release;  distribution  is  unlimited. 
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Approved  for  public  release;  distribution  is  unlimited. 
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distribution  is  unlimited. 
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Approved  for  public  release;  distribution  is  unlimited. 
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Frequency  Responses  and  Errors 

Closed-form  beam  patterns  obtained  using  the  32-point  a-DFT  algorithm  and  FFT  are  shown  in 
Figure  34.  Figure  34  (a)  shows  the  exact  DFT  beams  and  Figure  34  (b)  shows  the  beams 
obtained  using  the  proposed  32-point  a-DFT.  Figure  34  (c)  shows  the  error  between  the  two 
transforms. 


Figure  34:  Closed-form  Beam  Patterns  obtained  using  the  32-point  a-DFT  Algorithm  and 

FFT 

(a)  Exact  DFT  beams,  (b)  beams  obtained  using  the  proposed  32 -point  a-DFT  and  (c)  error 

between  the  two  transforms 


The  1-D  closed  form  beam  patterns  obtained  using  the  32-point  a-DFT  algorithm  and  FFT  for  a 
Nyquist  spaced  ULA  are  shown  in  Figure  35. 


Figure  35:  1-D  Closed-form  Beam  Patterns  obtained  using  the  32-point  a-DFT  Algorithm 

and  FFT  for  a  Nyquist  spaced  ULA 

(a)  Exact  DFT  Beams,  (b)  beams  obtained  using  the  proposed  16-point  a-DFT,  and  (c)  error 

between  the  two  transforms 

Comparison  of  32-point  a-DFT  with  Reduced  Precision  FFT 

The  comparison  for  all  the  exact  FFT  bins  and  the  a-DFTs  with  reduced  precision 
implementation  of  the  DFT  coefficient  is  repeated  for  the  32-point  case.  The  following  plots 
compares  the  frequency  responses  of  the  exact  FFT  and  the  proposed  a-DFT  algorithm  for  each 
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bin  of  the  trans-  form.  For  comparison,  we  have  considered  a  reduced  precision  of  4-bits  for  each 
frequency  bin  for  the  exact-DFT  twiddle  factors. 


- 32-bil  FFT 

— -  4-bit  FFT 
- approx-DFT 


-1  0  1 
Normalized  Frequency 


Figure  36:  Output  Comparison  for  Bin  1 


Normalized  Frequency 


Figure  37:  Output  Comparison  for  Bin  2 
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Figure  38:  Output  Comparison  for  Bin  3 


Normalized  Frequency 


Figure  39:  Output  Comparison  for  Bin  4 
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Figure  40:  Output  Comparison  for  Bin  5 


Normalized  Frequency 


Figure  41:  Output  Comparison  for  Bin  6 
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Figure  42:  Output  Comparison  for  Bin  7 


Normalized  Frequency 


Figure  43:  Output  Comparison  for  Bin  8 
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Figure  44:  Output  Comparison  for  Bin  9 


Normalized  Frequency 


Figure  45:  Output  Comparison  for  Bin  10 
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Figure  46:  Output  Comparison  for  Bin  11 


Normalized  Frequency 


Figure  47:  Output  Comparison  for  Bin  12 
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Figure  48:  Output  Comparison  for  Bin  13 


Normalized  Frequency 


Figure  49:  Output  Comparison  for  Bin  14 
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Figure  50:  Output  Comparison  for  Bin  15 


Normalized  Frequency 


Figure  51:  Output  Comparison  for  Bin  16 
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Figure  52:  Output  Comparison  for  Bin  17 


Normalized  Frequency 


Figure  53:  Output  Comparison  for  Bin  18 
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Figure  54:  Output  Comparison  for  Bin  19 


Normalized  Frequency 


Figure  55:  Output  Comparison  for  Bin  20 
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Figure  56:  Output  Comparison  for  Bin  21 


Normalized  Frequency 


Figure  57:  Output  Comparison  for  Bin  22 
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Figure  58:  Output  Comparison  for  Bin  23 


Normalized  Frequency 


Figure  59:  Output  Comparison  for  Bin  24 
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Figure  60:  Output  Comparison  for  Bin  25 
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Figure  61:  Output  Comparison  for  Bin  26 
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Figure  62:  Output  Comparison  for  Bin  27 
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Figure  63:  Output  Comparison  for  Bin  28 
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Figure  64:  Output  Comparison  for  Bin  29 
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Figure  65:  Output  Comparison  for  Bin  30 
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Figure  66:  Output  Comparison  for  Bin  31 


Normalized  Frequency 


Figure  67:  Output  Comparison  for  Bin  32 
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3.4  64-point  Approximate  DFT  Algorithm 


Different  candidate  matrices  for  the  64-point  approximate  DFT  were  obtained  based  on  the 
realization  complexity  and  the  closeness  to  the  exact  DFT  (governed  by  a  defined  threshold 
parameter  a).  These  matrices  were  analyzed  to  find  the  matrix  that  yields  the  lowest  hardware 
realization  complexity  for  which  the  performance  is  acceptable  for  beamforming  applications. 


The  frequency  responses  of  the  filter  bank  beams  for  a  =  1  and  2  are  shown  in  Figure  68. 
According  to  that  the  plots  it  can  be  seen  that  the  matrix  arising  from  a  =  1  which  has  the  lowest 
complexity  gives  acceptable  performance  in  beamforming.  64-point  a-DFT  matrix  for  a  =  1 

/V 

(F64)  is  given  below.  For  convenience  of  representation  we  use  the  following  notation  to  denote 
the  matrix: 
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F&4  can  be  factorized  to  reduce  the  adder  complexity.  F64  has  been  factorized  into  12  stages, 
which  can  be  denoted  as 

F64  =  WnWnWioW9WsWiW6W5W4W3W2Wi, 
where,  Wi  (i=  1,  2, . . .  12)  denotes  a  sparse  matrix.  WiS  are  shown  below. 

Following  notation  is  used  to  denote  the  factorized  matrices. 


wi  = 


[R°° 

Rfl 

[If 

Ifl 

R>° 

Rn. 

+  j 

[if 

hUJ 

where  i  =  1.  2, . . .  12. 


Note  that  Wi,  W3,  Ws,  Wi,  W%  Wn  and  W12  are  real  matrices. 
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Frequency  Responses  and  Errors 


Figure  68  (a)  and  (b)  show  the  Matlab  simulated  beams  using  the  obtained  approximate  matrices 
for  a  =  1  and  2.  Figure  68  (c)  shows  the  beams  corresponding  to  the  exact  DFT. 


Figure  68:  Simulated  Beams  (a)  a  =  1,  (b)  a  =  2,  and  (c)  exact  64-point  DFT 


Figure  69  (a)  shows  the  simulated  64-point  a-DFT  beams  and  (b)  shows  the  corresponding 
beams  for  using  the  exact  DFT.  Figure  (c)  depicts  the  error  between  the  magnitude  responses. 


Figure  69:  (a)  Simulated  a-DFT  Beams,  (b)  Corresponding  Exact  DFT  Beams,  and  (c) 

Error  between  Magnitude  Responses 

The  beam  outputs  pertaining  to  all  the  bins  of  64-point  a-DFT  are  shown  below. 
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Figure  70:  Output  Comparison  for  Bin  1 


Figure  71:  Output  Comparison  for  Bin  2 
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Figure  72:  Output  Comparison  for  Bin  3 


Figure  73:  Output  Comparison  for  Bin  4 
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Figure  74:  Output  Comparison  for  Bin  5 
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Figure  75:  Output  Comparison  for  Bin  6 
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Figure  76:  Output  Comparison  for  Bin  7 
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Figure  77:  Output  Comparison  for  Bin  8 
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Figure  78:  Output  Comparison  for  Bin  9 
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Figure  79:  Output  Comparison  for  Bin  10 
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Figure  80:  Output  Comparison  for  Bin  11 
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Figure  81:  Output  Comparison  for  Bin  12 
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Figure  82:  Output  Comparison  for  Bin  13 
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Figure  83:  Output  Comparison  for  Bin  14 
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Figure  84:  Output  Comparison  for  Bin  15 
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Figure  85:  Output  Comparison  for  Bin  16 
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Figure  86:  Output  Comparison  for  Bin  17 
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Figure  87:  Output  Comparison  for  Bin  18 
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Figure  88:  Output  Comparison  for  Bin  19 
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Figure  89:  Output  Comparison  for  Bin  20 
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Figure  90:  Output  Comparison  for  Bin  21 
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Figure  91:  Output  Comparison  for  Bin  22 
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Figure  92:  Output  Comparison  for  Bin  23 
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Figure  93:  Output  Comparison  for  Bin  24 
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Figure  94:  Output  Comparison  for  Bin  25 
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Figure  95:  Output  Comparison  for  Bin  26 
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Figure  96:  Output  Comparison  for  Bin  27 
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Figure  97:  Output  Comparison  for  Bin  28 
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Figure  98:  Output  Comparison  for  Bin  29 
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Figure  99:  Output  Comparison  for  Bin  30 
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Figure  100:  Output  Comparison  for  Bin  31 
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Figure  101:  Output  Comparison  for  Bin  32 
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Figure  102:  Output  Comparison  for  Bin  33 
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Figure  103:  Output  Comparison  for  Bin  34 
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Figure  104:  Output  Comparison  for  Bin  35 
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Figure  105:  Output  Comparison  for  Bin  36 
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Figure  106:  Output  Comparison  for  Bin  37 
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Figure  107:  Output  Comparison  for  Bin  38 
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Figure  108:  Output  Comparison  for  Bin  39 
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Figure  109:  Output  Comparison  for  Bin  40 
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Figure  110:  Output  Comparison  for  Bin  41 
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Figure  111:  Output  Comparison  for  Bin  42 
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Figure  112:  Output  Comparison  for  Bin  43 


Figure  113:  Output  Comparison  for  Bin  44 
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Figure  114:  Output  Comparison  for  Bin  45 


Figure  115:  Output  Comparison  for  Bin  46 
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Figure  116:  Output  Comparison  for  Bin  47 
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Figure  117:  Output  Comparison  for  Bin  48 
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Figure  118:  Output  Comparison  for  Bin  49 
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Figure  119:  Output  Comparison  for  Bin  50 
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Figure  120:  Output  Comparison  for  Bin  51 
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Figure  121:  Output  Comparison  for  Bin  52 
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Figure  122:  Output  Comparison  for  Bin  53 
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Figure  123:  Output  Comparison  for  Bin  54 


118 

Approved  for  public  release;  distribution  is  unlimited. 


Magnitude  [dB]  Magnitude  [dB] 


-3-2-10123 
Normalized  Frequency 


Figure  124:  Output  Comparison  for  Bin  55 


Figure  125:  Output  Comparison  for  Bin  56 
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Figure  126:  Output  Comparison  for  Bin  57 


Figure  127:  Output  Comparison  for  Bin  58 
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Figure  128:  Output  Comparison  for  Bin  59 
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Figure  129:  Output  Comparison  for  Bin  60 
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Figure  130:  Output  Comparison  for  Bin  61 


-3-2-10123 
Normalized  Frequency 


- 32-bil  FFT 

- 4-bit  FFT 

- approx-  DFT 


Figure  131:  Output  Comparison  for  Bin  62 
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Figure  132:  Output  Comparison  for  Bin  63 
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Figure  133:  Output  Comparison  for  Bin  64 
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3.5  VLSI  Implementation  and  Comparison  of  the  Hardware  Complexities 

All  approximate  transforms  found  were  implemented  on  digital  hardware  targeting  FPGA 
implementation.  Xilinx  tools  were  used  to  implement  the  designs  in  a  fully  parallel  input  parallel 
output  architecture.  In  order  to  obtain  a  comparison  of  the  performance  metrics  such  as  area, 
time  and  power  of  the  approximate  transforms,  the  corresponding  exact  transforms  were  also 
implemented.  The  implemented  designs  were  pipelined  for  maximum  speed  of  operation  and 
were  synthesized  and  mapped  targeting  the  Xilinx  Virtex-6  sx475t  chip.  Table  2  summarizes  the 
hardware  utilization  and  the  critical  path  delay  for  8-,  16-,  32-,  and  64-point  transforms  (both 
approximate  and  exact).  The  comparison  has  been  performed  for  8-  and  16-bit  input  word  length 
sizes.  The  twiddle  factor  word  length  for  the  exact  FFT  designs  has  been  fixed  to  8  bits  for 
all-point  designs. 


Table  2.  Comparison  of  Hardware  Resource  Consumption  using  Xilinx  Virtex-6  SX475T 
for  different  Numbers  of  Points  with  different  Input  Precision 


FFT 

Design 

Word 

Length 

T{CPD}  (ns) 

Slice  Registers 

LUTs 

Occupied  Slices 

Flip-flops 

Exact 

Appr. 

Exact 

Appr. 

Exact 

Appr. 

Exact 

Appr. 

Exact 

Appr. 

8-point 

8-bits 

2.107 

1.929 

1,4H 

1,288 

1,456 

1,012 

459 

345 

1,633 

1,231 

16-bits 

2.125 

2.009 

1,705 

2,280 

1,905 

1,690 

572 

582 

2,055 

2,131 

16-point 

8-bits 

1.886 

1.966 

3,247 

2,528 

4,030 

2,488 

1,338 

809 

4,543 

2,765 

16-bits 

2.043 

2.029 

3,634 

4,352 

5,070 

4,238 

1,545 

1,301 

5,410 

4,635 

32-point 

8-bits 

3.420 

2.212 

4,074 

6,420 

7,193 

5,866 

2,265 

1,888 

7,287 

6,465 

16-bits 

3.611 

2.085 

6,698 

10,788 

11,920 

9,837 

3,507 

2,702 

12,019 

10,252 

64-point 

8-bits 

2.316 

2.143 

18,980 

10,889 

40,976 

16,725 

11,857 

5,407 

41,831 

17,240 

16-bits 

2.661 

2.216 

39,218 

20,322 

101,859 

29,033 

27,579 

8,566 

101,860 

30,023 
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3.6  ASIC  Realization  Metrics  Comparison 

The  designs  were  also  mapped  to  45-nm  complementary  metal-oxide-semiconductor  (CMOS) 
technology  cells  (synthesis  only)  for  a  better  performance  level.  Key  quantitative  measures  of 
performance  for  the  a-DFT  realizations  are  listed  in  Table  3. 

Table  3.  Quantitative  Measures  of  Performance  for  Approximate  DFT  Realizations 


Performance 

16- point 

32-point 

Metric 

16-point 

32-point 

radix-2  FFT  a.  DFT 

Change 

Duhamel 

a^DFT 

Change 

algorithm 

algorithm 

Area  -  A  (771m2) 

0.295 

0.166 

44%1 

0.S56 

0.465 

46  %f 

CriticaJ  path 
delay  -  T  (ns) 

1  35 

0.9 

33%f 

1.73 

0.86 

Frequency  -  F^a? 
{GHz) 

0.74 

1.11 

0.58 

1.16 

100%f 

AT 

0.398 

0.149 

1.481 

0.400 

73%| 

“AT3 

(mm2ns2) 

0.537 

0.134 

2.562 

0.344 

Dynamic 

Power  -  Dp 
{■ mW/GHz ) 

380.52 

231.34 

m%i 

1303 

580 

56%| 

Largest 
side-lobe 
level  (dB) 

-13.26 

-13.05 

0.21f 

ss- 13.26 

-11.03 

2.23f 

3.7  Simulation  of  2-D  Beams  Cross  Sections 

The  computational  complexity  associated  with  obtaining  N2  simultaneous  beams  using  an  N  x  N 
rectangular  aperture  grows  exponentially  as  0(N2  log2  N2).  In  general,  a  2-D  signal  x(m,  n) 
where  m  =  0,  1,  ...,  M-  1  and  n  =  0,  1,  1,  the  2-D  DFT  is  defined  as 

Af— 1  N- 1 

X(kJ)=  ^  x{m,n)e  m  mt  n  nm 

m= 0  7i—0 


This  can  be  rewritten  as 


M- 1 


X(k ,  1)  =  £ 


m=D  Ln=0 


N-l 


x — ^  ,  v  - y'2^1  tt 

>  x(m,  n)e  x 


- j  , 

e — at — 


A/-1 

=  y  G(m.  l)e ^ 


TT 


m=D 


where  G(m,  /)  is  the  1-D  DFT  along  n.  Therefore,  the  2-D  transform  of  an  aperture  array  can  be 
realized  as  a  row-wise  transformation  of  the  column-wise  transform.  The  replacement  of  the  FFT 
with  the  a-DFT  would  reduce  the  required  hardware  complexity  to  a  greater  extent  since  zero 
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multipliers  are  involved.  Thus,  with  the  use  of  a-DFT  cores,  2-D  beams  associated  with  the 
N  x  N  aperture  can  be  realized  at  zero  multiplier  complexity.  For  example,  a  radix-2  realization 
of  8  x  8  2-D  FFT  operation  would  require  64  real  multipliers;  16  x  16  would  require  768  real 
multipliers;  and  32  x  32  would  require  5632  real  multipliers. 

The  2-D  beam  plots  obtained  by  the  use  of  the  a-DFT  transforms  were  analyzed  across  the 
azimuth  and  elevation  angle  cuts.  Following  section  provides  2-D  beam  plots  and  their  sliced 
beam  patterns  over  the  azimuthal  and  elevation  angles  for  2-D  beams  arising  from  8  x  8,  16  x  16 
and  32  x  32  a-DFT  transforms.  For  the  simulations,  the  angles  are  measured  as  shown  in  the 
Figure  134. 


w(x,y, ct) 


£5^ — 7 - 

'A. ... 


Antenna  elements 


0  <  </>  <  180 


-90  <  't{j  <  90 


Figure  134:  Symbol  Convention  for  the  Simulated  Plots 
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3.7.1  8-point  Approximation 

(i) 


(*)  fdl 

Figure  135:  (i):  2-D  Plots  for  (p=  90.00,  y/=  -30.00  and  (ii)  2-D  plots  for  <p=  71.60, 

y/=  -52.00 

Notes  for  (i):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  <p  for  fixed  y/  =  -30.00,  and  (d)  along  y/  for 

fixed  (p  =  90. 00. 

Notes  for  (ii):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  <p  for  fixed  y/  =  -52.00,  and  (d)  along  y/  for 

fixed  cp  =  71.60. 
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3.7.2  16-point  Approximation 


(i) 


Figure  136:  (i):  2-D  plots  for  (p  =  44.80,  \| /  =  -62.00  and  (ii):  2-D  plots  for  <p  =  16.00, 

i|/  =  -65.50 

Notes  for  (i):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  (p  for  fixed  i//  =  -62. 00,  and  (d)  along  )//  for 

fixed  (p  =  44.80. 

Notes  for  (ii):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  (p  for  fixed  >//  =  -65.50,  and  (d)  along  >//  for 

fixed  (p  =  16.00. 
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3.7.3  32-point  Approximation 


(i) 


(c) 


(d) 


Figure  137:  (i):  2-D  plots  for  (p  =  23.20,  i|/ = -28.50  and  (ii):  2-D  plots  for  <p  =  26.40, 

\| /  =  -78.00 

Notes  for  (i):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  (p  for  fixed  i//  =  -28.50,  and  (d)  along  )//  for 

fixed  (p  =  23.20. 

Notes  for  (ii):  (a)  a-DFT,  (b)  exact  DFT,  (c)  along  (p  for  fixed  >//  =  -70.00,  and  (d)  along  >//  for 

fixed  (p  =  26.40. 
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4.  HARDWARE  SETUP 


To  physically  obtain  and  measure  the  beams  from  the  proposed  approximate  DFT  algorithms 
(and  compare  them  with  the  DFT  beams),  a  2.4-GHz  RF  system  with  a  digital  processing-end 
was  designed  and  implemented.  Figure  138  shows  the  overall  architecture  of  the  implemented 
beam  measurement  system.  The  initial  system  was  designed  as  a  16  antenna-element  system. 
Main  subsystems  were  identified  as  the  antenna  array,  RF  receiver  chain  and  the  digital 
processing  unit.  Each  antenna  element  was  associated  with  an  in-phase  (I)  and  quadrature  (Q) 
receiver  chain  where  the  amplification,  mixing  and  filtering  is  performed.  Next,  the  obtained 
downconverted  based  band  signal  is  processed  using  a  digital  hardware  through  analog-to-digital 
conversion  (ADC).  In  digital  processing,  the  DFT-based  multi-beamforming  is  performed.  Each 
beam  output  is  then  further  processed  to  integrate  and  estimate  the  received  beam  energy  for  a 
specific  antenna  orientation. 
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Figure  138:  The  System  Architecture 
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4.1  2.4-GHz  Antenna  Array 


A  16-element  antenna  array  was  designed  to  work  2.4  GHz.  A  single  element  patch  was 
simulated  and  fabricated,  as  shown  in  Figure  139  (c).  Figure  139  (a)  and  (b)  show  the  simulated 
and  measured  |,m|  respectively.  Next,  a  16-element  array  was  constructed  where  the  element 
spacing  was  set  to  A/2  ~  62.5  mm.  Figure  139  (d)  shows  a  picture  of  a  built  individual-element 
and  the  sub  figure  (b)  shows  the  full  16-element  antenna  array.  Each  element  was  tested  working 
using  a  transmitted  2.4  GHz  signal.  Figure  139  (e)  shows  the  measured  power  pattern  of  the 
array-elements.  Measurements  were  obtained  using  the  setup  described  in  Section  4.4. 


Figure  139:  simulated  and  fabricated  single  element  patch  for  16-element  antenna  array 

(a)  Simulated  sn,  (b)  measured  sn for  a  single  patch  antenna,  (c)  fabricated 
2.4  GHz  patch  antenna  element  with  the  integrated  low  noise  amplifier,  (d)full  16-element 
antenna  array,  and  (e)  measured  antenna  patterns. 

4.2  RF  Receiver  Chain 

The  RF  receiver  chain  was  designed  as  illustrated  in  Figure  138.  Commercial  off-the-shelf 
(COTS)  components  were  used  to  build  the  receivers.  First,  the  captured  signal  is  amplified 
using  a  low  noise  amplifier  (LNA).  Then  it  is  bandpass  filtered  to  filter  out  the  2.4-GHz  signals. 
The  amplified  and  filtered  signal  is  then  split  in  two  to  achieve  in-phase  quadrature  (IQ) 
downconversion.  A  local  oscillator  (LO)  signal  is  90°  split  and  fed  to  the  two  mixers  to  obtain  the 
IQ  downconverted  signals.  The  mixer  output  is  then  low-pass  filtered  and  again  amplified  to 
boost  the  filtered  baseband  signal.  Figure  140  (a)  shows  the  full  16-element  receiver  chains  using 
the  COTS  components. 
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4.3  Digital  Hardware  and  Design  Architectures 

Collaboration  for  Astronomy  Signal  Processing  and  Electronics  Research  (CASPER) 
Reconfigurable  Open  Architecture  Computing  Hardware  (ROACH)  has  been  used  in  our 
systems  to  sample  the  intermediate  frequency  (IF)  signal  and  perform  the  digital  beamforming. 
ROACH-2  [6]  is  an  open  source  platform  which  includes  following  notable  features: 

•  Virtex-6  SX475T  field-programmable  gate  array  (FPGA), 

•  PowerPC  440EPx  stand-alone  processor  to  provide  control  functions, 

•  2x  Multi-gigabit  transceiver  card  slots  (4x1  OGE), 

•  2  ZDOK  interfaces. 

The  two  ZDOK  interfaces  can  be  used  to  integrate  daughter  ADC  cards  manufactured  by 
CASPER  to  perform  digitization  of  the  signals.  The  current  setup  employs  two  ADC  16x250-8 
cards  [7]  where  each  card  can  accommodate  up  to  16  analog  inputs.  The  two  cards  together 
provide  32  analog  inputs  enabling  the  sampling  of  32  channels  arising  from  the  16-element 
channels.  The  cards  can  be  configured  to  achieve  different  sampling  rates,  i.e.,  32  inputs  up  to 
240  MHz,  16  inputs  up  to  480  MHz  and  8  inputs  up  to  960  MHz.  A  picture  of  the  ADC  cards 
installed  on  the  ROACH-2  platform  is  shown  in  Figure  140  (b).  The  ROACH-2  platform  comes 
with  a  high-end  Virtex-6  SX475T  FPGA  which  can  accommodate  large  designs.  The  device  has 
476,160  logic  cells,  74,400  configuration  logic  blocks  (CLBs)  and  2016  DSP48  slices. 


Figure  140:  (a)  Receiver  Chains  Implemented  using  COTS  Components  and  (b)  ROACH-2 
Processing  Platform  with  the  Two  ADC  Cards  Connected  to  the  FPGA  Board 


FPGA  Designs  for  Multi-Beamforming 

The  digital  designs  for  performing  multi-beamforming  using  the  proposed  algorithm  and  the  ex¬ 
act  DFT  (for  comparison  purposes)  were  designed  using  Xilinx  tools.  The  FPGA  designs  for 
both  approximate  and  exact  DFTs  have  been  designed  for  8-,  16-,  32-  and  64-  point  transforms. 
Fully  parallel  input,  parallel  output  architecture  has  been  adopted  while  designing  to  achieve 
maximum  speed  of  operation.  The  designs  have  been  tested  and  verified  using  hardware 
cosimulation  methods.  The  hardware  resource  consumption  and  the  timing  information  were 
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recorded  for  both  approximate  and  exact  DFTs  by  synthesizing  and  mapping  them  to  the  Xilinx 
Virtex  6  SX475T  chip.  Table  2  (on  page  127)  summarizes  the  key  figures  of  merit. 

For  obtaining  the  measurements,  all  the  transform  cores  were  configured  to  an  8-bit  input  world 
length.  This  was  done  since  the  ADC  16x250-8  ADC  cards  used  in  the  setup  had  an  8-bit  output. 
The  cores  were  pipelined  so  that  they  could  be  run  at  a  clock  speed  of  200  MHz  clock  period. 

Digital  Circuit  Architectures  for  Beam  Measurements 

Apart  from  the  digital  cores  for  performing  the  spatial  DFT  of  the  sampled  signals,  other 
additional  circuitry  needed  to  support  beam  measurement  were  also  developed  inside  FPGA  for 
convenience  in  manipulating  the  data  in  real  time  and  for  different  angles.  Figure  141  shows  the 
overview  architecture  of  the  digital  circuit. 

The  front  portion  of  the  digital  architecture  consists  of  the  digital  normalizing  circuit,  which  is 
used  to  calibrate  RF  chains.  The  calibration  procedure  is  described  in  Section  4.4.  This  stage 
consists  of  a  set  of  multipliers  (an  N  element  design  will  need  2 N  multipliers  in  this  stage)  where 
one  input  of  each  multiplier  is  connected  to  a  32-bit  software  controllable  register  (SCR).  The 
other  input  is  the  ADC  channel.  These  software  configurable  registers  are  all  first  set  to  1  to 
determine  the  calibration  gains  of  each  RF  chain  for  a  reference  input.  Once  the  calibration  gains 
are  determined  for  each  channel,  each  SCR  is  overridden  with  the  corresponding  gain  value. 


Figure  141:  Digital  Architecture  for  obtaining  a-DFT/DFT  Beam 


After  the  digital  normalizing  stage,  the  signals  are  driven  to  the  a-DFT/DFT  digital  core.  The  in- 
phase  signal  (either  I  or  Q)  will  be  fed  to  the  real  inputs  of  the  core  and  the  quadrature  signal  (or 
the  other)  will  be  fed  to  the  imaginary  outputs.  Next,  the  real  and  imaginary  outputs  of  the 
corresponding  output  bin  of  the  digital  FFT  core  are  sent  for  calculating  the  instantaneous  power 
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of  the  sample.  This  is  achieved  by  performing  (Re/Ykj)2  +  (Im{Yk})2  where  0  <  k  <  N  -  1 .  This 
is  implemented  with  two  multipliers  and  one  adder  per  channel.  The  word  length  of  the  input  to 
this  block  will  depend  on  the  bit  growth  due  to  the  a-DFT/FFT  core.  The  output  from  this  block 
will  be  sent  to  an  accumulator  to  integrate  over  a  pre-specified  time  period.  The  time  of 
integration  is  designed  to  be  modifiable  through  software  control. 

The  overall  architecture  is  designed  to  perform  the  functionality  of  a  lock-in  amplifier  to  filter 
out  ambient  2.4-GHz  radiation  present  in  the  environment.  To  achieve  this,  the  transmitted  signal 
will  be  switched  on  and  off  at  a  particular  rate,  and  the  energy  level  received  when  the 
transmitter  is  on  and  off  are  calculated  separately.  The  transmitter  design  approach  is  discussed 
at  the  end  of  the  Section  4.4.  The  digital  circuit  for  the  receiver  has  been  designed  to  cope  with 
this  setup.  For  this  purpose,  two  integrators  will  be  used  for  each  channel  and  these  integrator  are 
activated  depending  on  whether  the  transmitted  signal  is  on  or  off.  An  energy  detector  is 
employed  at  the  front  of  the  circuit  to  achieve  this  functionality.  The  Boolean  output  from  this 
block  will  be  used  as  a  select  signal  of  a  demultiplexer  (or  demux)  that  selects  one  of  the 
integrators  when  the  RF  is  on  and  the  other  when  RF  is  off.  Finally,  the  difference  of  the  two 
integrator  values  is  computed  as  the  received  energy  of  a  particular  bin.  Computed  values  are 
updated  in  FPGA  memory  and  are  read  to  the  host  server  using  the  software  routines. 

4.4  Setup  for  Beam  Measurement 

Figure  142  shows  the  full  experimental  setup  for  obtaining  the  beam  measurements.  Figure  142 
(a)  shows  the  receive-mode  2.4-GHz  array  setup  inside  an  anechoic  chamber.  A  2.4-GHz 
directional  transmitter  antenna  is  employed  at  one  end  of  the  chamber  to  generate  a  plane  wave 
tone.  The  transmitter  and  the  receiver  array  is  separated  by  2  meters  to  ensure  that  the  receiver 
array  is  in  the  far  field  of  the  transmitter.  The  transmitter  remains  fixed  and  the  receiver  array  is 
rotated  around  its  center  using  a  precision  rotation  platform  controlled  by  software  to  take 
measurements  of  the  received  energy  level  for  different  angles.  Figure  142  (c)  and  (d)  show  a 
close  up  of  the  receiver  array  and  the  precision  rotation  platform  used  to  rotate  the  array, 
respectively.  The  receiver  chains,  FPGA  setup,  signal  generators  and  other  equipment  are  placed 
outside  the  anechoic  chamber.  The  antenna  array  feeds  the  receiver  chains  via  coaxial  cables. 

Figure  142  (b)  shows  the  signal  processing  end  of  the  beamformer.  Three  oscillators  are  used  in 
the  setup.  One  is  used  to  generate  the  transmitted  2.4-GHz  carrier  tone.  A  “NOISE  XT  SLC”  low 
jitter  clock  synthesizer  was  used  to  generate  the  LO  signal.  The  third  oscillator  was  used  to  clock 
the  ROACH-2  (FPGA)  and  to  perform  the  sampling  of  the  IF  signal.  The  ROACH-2  FPGA 
platform  was  connected  to  a  host  Linux  server  for  software  control  of  the  measurement  setup. 

The  rotation  platform  motor  controller  was  also  connected  to  the  same  computer  for 
simultaneous  software  control  via  single  software  integration. 
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Figure  142:  Experimental  Setup 

(a)  Transmitter  and  receiver  in  the  anechoic  chamber,  (b)  receiver  instrumentation  setup 
including  the  RF  receivers,  (c)  front-view  of  the  antenna  array,  and  (d)  rotation  platform. 


Software  End  Integration 

A  fully  Python  based  software  controlled  system  has  been  developed  to  perform  the  beam 
measurement  task  in  full  automated  manner  on  top  of  the  software-to-hardware  interface  layer 
provided  by  the  ROACH-2  platform.  A  sub  Python  routine  was  developed  to  control  the  motor 
for  precise  rotation  of  the  array  for  beam  angle  measurements.  An  “8SMC4-USB”  motor 
controlling  platform  [8]  was  used  to  issue  commands  from  software  routines  to  the  platform  via  a 
virtual  COM-port.  ROACH-2  platform  provides  a  middle  layer  to  communicate  between  the 
FPGA  hardware  (memory)  by  connecting  to  the  on-board  computer.  ROACH-2  is  connected  to 
the  main  host  Linux  server  through  a  1  Gbps  ethemet  connection.  The  main  Python  routine  is 
programmed  to  access  the  ROACH-2  platform  to  perform  control  functions  and  read  data  from 
the  FPGA  memory  while  iteratively  scanning  through  the  angles.  Altogether,  this  constitute  a 
fully  automated  beam  measurement  setup,  allowing  all  beam  measurements  can  be  performed  in 
a  single  run. 


Calibration 


Prior  to  obtaining  measurements,  the  circuits  require  calibration  to  achieve  proper  functionality. 
Basically,  calibration  was  needed  at  two  main  points.  First,  each  RF  receiver  needed  to  be 
calibrated  since  the  mismatches  occurred  in  amplification,  mixing  and  filtering.  In  addition, 
calibration  of  the  ADC  chips  integrated  into  the  ROACH-2  platform  was  required. 

ADC  Calibration.  Calibration  of  the  ADC  chips  was  performed  with  calibration  scripts 
provided  by  CASPER  [9],  These  scripts  facilitated  calibration  for  a  reference  input  signal  of  the 
same  dynamic  range  as  the  actual  input.  A  separate  microwave  circuit  was  included  in  the  RF 
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front  end  to  achieve  this  and  calibration  of  the  RF  front  ends  was  achieved  as  well.  The 
calibration  setup  is  described  under  “RF  Receiver  Chain  Calibration”. 

RF  Receiver  Chain  Calibration.  Due  to  mismatches  of  the  32  RF  chains  of  the  current  setup, 
the  outputs  to  a  reference  input  signal  were  not  uniform.  Thus  we  modified  our  RF  front  end  to 
include  an  additional  microwave  circuit.  This  allows  calibration  of  each  RF  chain  digitally  using 
a  reference  input  signal.  The  RF  chain  calibration  setup  is  shown  in  Figure  143.  A  set  of 
combiners  was  used  at  the  front  of  each  RF  chain  to  facilitate  another  reference  input  to  the  RF 
chain. 


Reference 

input 

Antenna 

inputs 


Additional 
circuit  for 
calibration 

Reference 
input  to  the 
main  splitter 


Figure  143:  RF-chain  Calibration  Setup 


This  step  eliminates  the  need  for  unscrewing  and  screwing  back  the  SubMiniature  version  A 
(SMA)  cables  from  antenna  outputs  to  each  RF  chain  each  time  a  calibration  is  needed.  These 
second  inputs  from  each  combiner  were  then  connected  to  a  splitter  that  could  feed  the  same 
reference  signal  by  splitting  16  channels.  Figure  144  shows  an  illustration  of  the  digital 
normalization  used.  Block  RAM  (BRAM)  captured  samples  for  a  reference  signal  of  10  MHz 
resulting  from  a  reference  input  signal  of  2.4  GHz  for  each  channel  as  shown  in  Figure  144  (a). 
Figure  144  (b)  shows  the  digitally  normalized  signals  that  neutralize  the  effect  of  mismatches  in 
the  RF  chain. 
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Figure  144:  (a)  BRAM  Captured  Reference  Signals  and  (b)  Digitally  Normalized  Signals 
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Lock-in  Amplifier  Setup  for  Obtaining  Measurements 

The  transmitter  component  of  the  test  setup  was  modified  to  realize  a  lock-in  amplifier  behavior 
to  improve  the  measurement  from  any  potential  reflections  or  ambient  2.4-GHz  radiation  present 
in  the  room  environment.  For  this  purpose,  the  transmitting  signal  was  converted  to  a  continuous 
on-off  pulse  of  a  2.4-GHz  signal.  Figure  145  (a)  shows  the  block  diagram  of  the  hardware 
configuration  used  to  generate  such  a  transmitted  signal.  Instead  of  directly  using  a  2.4-GHz 
signal  input,  a  1.2-GHz  signal  was  used.  This  signal  was  modulated  to  on-off  keying  by  using  an 
IF  signal  generated  from  another  FPGA  board  (Xilinx  Xtreme  DSP  kit  4  [10])  via  its  digital  to 
analog  converter  (DAC).  This  signal  was  then  split,  mixed,  and  bandpass  filtered  to  obtain  the 
2.4-GHz  continuous  pulse  signal.  Figure  145  (b)  shows  the  COTS  component  realization  of  the 
transmitted  signal  generation  circuit  using  commercially  available  mixers  and  amplifiers.  The 
energy  detector  block  shown  in  the  digital  circuit  architecture  in  Figure  141  was  used  to  detect 
the  presence  of  the  carrier.  Figure  145  (c)  shows  a  capture  of  samples  from  FPGA  corresponding 
to  such  a  transmission  (downconverted). 


Figure  145:  Lock-in  Amplifier  Design  made  for  generating  the  Transmitted  Signal  with 

On-Off  Keying 


4.5  Beam  Measurement  Results 
8-point  Beam  Measurements 

The  center  8-elements  of  the  16-element  array  were  used  to  test  and  measure  the  beams  obtained 
using  the  8-point  approximate  transform.  The  digital  design  architecture  shown  in  Figure  141 
was  used  with  the  8-point  digital  cores  designed.  The  precision  rotor  stage  was  used  to  obtain  the 
received  energy  for  a  resolution  of  1°  ranging  from  -65°s  to  +65°s  of  array  broadside.  Once  the 
array  is  moved  to  a  new  position,  all  the  integrators  are  reset  and  integration  is  started  on  all  the 
beam  output  signals  simultaneously  to  a  preset  amount  of  time  (clock  cycles).  The  computed 
values  are  then  stored  in  the  BRAMs  of  the  FPGA.  The  Python  routine  is  then  used  to 
communicate  with  the  on-board  PC  on  the  ROACH-2  platform  to  read  these  values  to  the  host 
PC  and  then  record  them. 

The  process  is  repeated  for  each  angle  according  to  the  rotation  resolution  used,  and  the  stored 
values  were  plotted  in  Matlab  to  generate  the  beam  patterns.  Same  procedure  is  repeated  using 
both  the  a-DFT  and  implemented  DFT  cores  for  comparison.  Figure  146  shows  the  plots 
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generated  from  the  measured  values.  The  digital  circuits  were  clocked  at  200  MHz.  The  LO 
signal  was  maintained  at  2.410  GHz  generating  a  10-MHz  IF  signal  to  the  FPGA  ADC  inputs.  A 
precision  rotor  stage  was  used  to  aid  the  precise  rotation  of  the  antenna  array  to  record  the 
received  energy  for  different  angles  as  shown  in  Figure  142.  The  XIMC  multi-platform 
programming  library  [8]  was  used  to  command  the  rotation  controller  via  a  virtual  COM-port 
interface.  A  fully  Python  based  software  controlled  system  was  developed  taking  advantage  of 
the  software-to-hardware  interface  layer  ROACH-2  provides.  The  software  platforms  are 
programmed  to  iteratively  increment  the  position  angle  of  the  array  to  produce  the  beam  patterns. 
Once  the  array  is  moved  to  a  new  position,  all  integrators  are  reset  and  integration  is  started  on 
all  beam  output  signals  simultaneously  for  a  preset  amount  of  time  (clock  cycles).  The  computed 
values  are  then  stored  in  the  BRAMs  of  the  FPGA.  The  Python  routines  are  then  used  to 
communicate  with  the  on-board  PC  on  the  ROACH-2  platform  to  read  these  values  to  the  host 
PC  and  then  record  them.  The  process  is  repeated  for  each  angle  according  to  the  rotation 
resolution  used,  and  the  stored  values  are  plotted  in  Matlab  to  generate  the  beam  patterns.  The 
same  procedure  is  repeated  using  both  the  a-DFT  and  implemented  DFT  cores  for  comparison. 
Figure  146  shows  the  plots  arising  from  the  measured  values. 

Also  as  a  reference,  Matlab-simulated  beam  patterns  for  each  transform  (approximate  and  exact) 
are  also  plotted,  taking  the  element  pattern  into  consideration.  That  is,  the  resultant  beam  pattern 
of  the  ideal  beam  pattern  resulting  from  the  transform  and  the  element  pattern  is  generated.  To 
make  this  more  realistic,  a  time  domain  simulation  has  been  conducted,  taking  the  measured 
element  pattern  of  each  antenna  into  account  by  scaling  the  signal  at  each  antenna  element  by  the 
gain  according  to  the  direction  of  reception.  It  should  be  noted  that  the  plot  containing  Bin  4  is 
only  shown  for  completeness.  The  beam  direction  for  this  bin  is  at  the  end-fire  (90°)  which  falls 
into  the  null  direction  of  each  antenna  pattern. 

Figure  147  shows  all  the  beam  patterns  in  single  plots.  Figure  147  (a)  shows  the  observed  beam 
patterns  using  the  approximate  transform  with  the  use  of  raw  values  measured  at  each  bin  output. 
It  can  be  noticed  that  bins  1 ,2  and  6,7  do  not  follow  the  element  pattern  due  to  non-uniform  gains 
inherent  in  the  approximate  transform.  Figure  147  (b)  shows  the  normalized  beam  patterns  for 
the  same  beams  in  the  log  domain,  where  each  beam  output  has  been  normalized  to  1  by  dividing 
by  each  beam’s  maximum  value.  Figure  147  (c)  shows  the  beam  patterns  observed  from  the 
exact  FFT  implementation.  Figure  147  (d)  depicts  the  normalized  beam  patterns  in  the  log 
domain.  It  should  be  noted  that  the  end- fire  beam  corresponding  to  the  beam  output  of  bin:4  has 
been  ignored  in  these  plots. 
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Figure  146:  Measured  and  Simulated  Beam  Patterns  for  each  Bin  of  8-point  Approximate 

and  Exact  Transforms 
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Figure  147:  8-point  Beam  Patterns  in  Single  Plots 

(a)  All  beam  patterns  using  the  approximate  transform  from  the  raw  values  measured  at  each  bin 
output,  (b)  the  normalized  beam  patterns  in  the  log  domain  for  the  approximate  transform,  (c)  all 
beam  patterns  obtained  using  the  exact  FFT  core,  and  (d)  the  normalized  patterns  of  (c)  in  the 

log  domain. 


16-point  Beam  Measurements 

The  same  measurement  procedure  was  repeated  using  the  full  16-elements  of  the  array  to  obtain 
the  measurements  generated  from  the  16-point  approximate  transform.  For  reference,  the  beams 
arising  using  the  exact-FFT  digital  core  were  also  measured.  It  is  a  critical  fact  that  the 
separation  between  transmitter  and  the  receiver  needs  to  be  high  enough  to  ensure  the 
assumption  that  a  plane  wave  is  received  by  the  array.  This  is  important  to  obtain  a  good 
measurement  of  the  beam  patterns.  During  broadside  calibration  it  was  observed  that  a 
significant  phase  deviation  existed  between  the  signals  captured  at  two  end-fire  elements.  This  is 
due  to  the  fact  that  the  physical  aperture  size  of  the  full  16-element  array  (96  cm)  is  comparable 
to  the  distance  between  the  transmitter  and  receiver.  Figures  148  and  149  show  the  individual 
beam  plots  arising  from  the  measured  values  for  both  approximate  and  exact  transforms.  The 
transmitter  and  the  receiver  separation  is  constrained  to  the  dimensions  of  the  anechoic  chamber 
and  this  issue  has  affected  the  side-lobe  performance  of  the  measured  result.  Bin  8  in  Figure  149 
corresponds  to  the  beam  looking  at  the  end-fire  which  falls  into  the  null  direction  of  each  antenna 
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pattern  and  is  shown  only  for  completeness. 

Figure  150  shows  all  the  beam  patterns  in  single  plots.  Figure  150  (a)  shows  the  observed  beam 
patterns  using  the  approximate  transform  with  the  use  of  raw  values  measured  at  each  bin  output. 
As  for  the  case  of  the  beams  measured  for  the  8-point  approximate  transform,  it  can  be  noticed 
that  bins  do  not  follow  the  element  pattern  due  to  non-uniform  gains  inherent  in  the  approximate 
transform.  Figure  150  (b)  shows  the  normalized  beam  patterns  for  the  same  beams  in  the  log 
domain,  where  each  beam  output  has  been  normalized  to  1  by  dividing  from  its  maximum  value. 
Figure  150  (c)  shows  the  beam  patterns  observed  from  the  exact  FFT  implementation.  Figure 
150  (d)  depicts  the  normalized  beam  patterns  in  the  log  domain.  It  should  be  noted  that  the 
end-fire  beam  corresponding  to  output  of  bin:  8  has  been  ignored  in  these  plots. 
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Figure  148:  Measured  Beam  Patterns  for  Bins  0-7  of  16-point  Approximate  and  Exact 

Transforms 
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Figure  149:  Measured  Beam  Patterns  for  Bins  8-15  of  16-point  Approximate  and  Exact 

Transforms 
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Figure  150:  16-point  Beam  Patterns  in  Single  Plots 

(a)  All  beam  patterns  drawn  in  one  plot  using  the  16-point  approximate  transform  from  the  raw 
measured  values  at  each  bin’s  output,  (b)  the  normalized  beam  patterns  ( log  domain)  for  the 
approximate  transform,  (c)  all  beam  patterns  obtained  using  the  exact  FFT  core,  and  (d) 
normalized  patterns  of  (c)  in  the  log  domain. 
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5.  FUTURE  RESEARCH 


Fast  cross  correlation  at  massive  throughput  is  a  critically  important  aspect  for  military  systems. 
Towards  realizing  fast,  precise  cross-correlations  at  massive  throughputs  (more  than  one  billion 
parallel  cross  correlations  per  second)  we  reduce  the  multiplier  complexity  from  0(N  log  N)  to 
zero,  for  small  N<  32  and  0(N  log  N )  to  0(N )  for  large  N  >  32  while  maintaining  the  adder 
complexity  at  0(N  log  N ). 

Approximate  DFT  algorithms  would  positively  impact  systems  having  FFTs  as  building  blocks 
e.g.,  radars,  cross-correlators,  uniform  DFT  filterbanks,  fractional  delays,  orthogonal-frequency 
division  multiplex  (OFDM)  systems,  and  multi-beam  arrays.  Our  a-DFTs  are  not  limited  to 
sparse  input  signals  and  are  multiplier-free,  which  leads  to  small  size,  weight,  and  power  (SWaP) 
circuit  realizations.  Our  algorithm  is  closed- form  but  trades  off  DFT-filterbank  shapes  by  a  small 
amount  (bounded  and  with  complete  theoretical  analysis  available  to  understand  the  trade-offs 
involved  as  a  function  of  frequency  [bin  number],  error  magnitude  and  distribution)  in  order  to 
break  the  lower  bounds  of  the  FFT  complexity  without  the  need  of  sparse  inputs.  The  proposed 
a-DFTs  are  suitable  for  the  fastest  digital  signal  processing  (at  microwave  and  mm- wave  radio 
frequencies)  at  low  circuit  complexity,  maximum  speed  and  low  power  consumption  while 
assuming  non-sparse  signals.  Future  studies  for  DARPA  Microsystems  Technology  Office 
(MTO)  will  seek  answers  to  the  following  scientific  questions: 

Ql.  How  can  the  DFT  operation  be  replaced  by  a  close  approximation  that  does  not  need  any  (or 
significantly  reduces  the  number  of)  multipliers?  What  are  the  suitable  fast  algorithms  for 
realizing  these  multiplierless/low  complexity  DFT  approximations  at  lowest  adder  complexity 
for  transform  sizes  in  the  range  of  128  to  4096  FFT  points?  This  research  question  directly  builds 
on  the  success  of  approximated-DFT  in  8-,  16-,  32-  and,  64-point  cases,  as  explored  in  the  first 
DARPA  MTO  seedling  project. 

Q2.  How  well  do  the  a-DFTs  compare  with  their  exact  DFT  counterparts  in  terms  of  frequency 
responses  of  the  filterbanks  for  the  larger  transforms  (128  <N<  2048)?  What  is  the  error 
magnitude  and  distribution  as  a  function  of  frequency?  What  is  the  trade-off  between 
approximating  the  DFT  and  reducing  the  area  on  chip  as  well  as  power  consumption?  How  can 
this  trade-off  be  quantified  for  a  scientific  comparison  and  inform  design  choices  by  DoD?  In 
particular,  we  are  interested  in  exploring  cross  correlation  as  one  of  the  major  practical 
applications  of  the  proposed  approximate-DFT  algorithms.  The  accuracy  of  the  cross-correlation, 
both  in  time  and  frequency  domains,  will  be  studied  and  compared  with  a  baseline  for  the  exact- 
FFT  complex  correlation? 

Q3.  What  is  the  performance  trade-off  considering  finite  precision  arithmetic  in  both  traditional 
FFT  and  the  proposed  a-DFT  algorithms?  How  does  baseline  fixed-point  FFT  cores  compare 
with  a-  DFT  in  a  design  space  covering  cross-  and  auto-correlation.  For  example,  correlation 
function  shape,  reference  thresholds  for  correlation  based  decision  making,  and  calculation  of 
power-spectrum. 
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Q4.  How  does  baseline  fixed-point  FFT  cores  compare  with  a-DFT  in  a  design  space  covering 
beam  fidelity,  pointing  accuracy,  requantization  (digital)  noise,  area,  time,  area-time,  area-time- 
squared,  dynamic  power  consumption  and  clock  speed  ?  How  can  the  low-complexity 
multiplierless  approximate-FFT  be  used  in  reducing  the  computational  complexity  of  multirate 
digital  FIR  filterbanks?  In  multirate  signal  processing,  FFT  and  IFFT  are  used  extensively  as 
vehicle  for  reducing  the  computational  complexity  of  polyphase  filterbanks.  We  explore  the 
possibility  of  replacing  the  FFT  with  approximate-FFT  and  thereby  reducing  the  filterbank 
complexity  even  further  while  making  a  compromise  in  filterbank  accuracy. 

Q5.  What  are  the  measured  experimental  responses  and  VLSI  system  implementation  metrics  for 
the  proposed  aperture  multi-beam  forming,  multirate  filter-banks  and  fast  cross-correlation 
method? 

Q6.  When  the  input  is  totally  real  valued,  can  we  replace  a-DFT  with  sparse  factor 
implementations  of  approximate  discrete  Hartley  transforms  (a-DHTs)  and  therefore  obtain 
additional  savings  in  multiplier  complexity  for  fast  complex  correlators?  Preliminary  results 
indicate  this  is  indeed  possible,  and  that  an  additional  saving  of  complexity  by  up  to  50%  over 
what  we  expect  by  adopting  a-DFTs  in  place  of  FFTs  may  in  fact  be  feasible  for  cross 
correlation. 

Future  work  will  involve  the  following: 

•  Recursive  twiddle-factor  quantization:  consider  fast  algorithms  for  the  DFT  (. N  >  32)  and 
optimally  map  the  multiplicands  into  low-complexity  dyadic  rationals.  An  initial  mapping 
would  be  to  directly  find  the  closest  rational  approximation  to  each  multiplicand.  Such  an 
approach  does  not  take  into  account  the  interplay  among  the  multiplicands.  Such  relations 
have  a  significant  role  in  finding  good  approximations.  Therefore,  mapping  derived  from 
multivariate  analysis  and  multi-criteria  optimization  schemes  might  be  sought. 

•  Hybrid  transformation  algorithms:  combination  of  an  approximate-DFT  and  exact  FFT 
factorization  towards  yielding  low-complexity  and  exceptionally  large  transform  sizes  which 
are  intractable  to  derive  using  direct  numerical  search  methods.  In  the  first  seedling,  we 
searched  and  found  three  optimized  versions  of  multiplierless  approximations  for  the 
32-point  FFT.  These  a-FFT  algorithms  do  not  require  any  multiplications  and  are  able  to 
maintain  the  number  of  additions  similar  to  available  exact-FFTs  (e.g.,  Duhamel  algorithm). 
We  will  explore  the  use  of  Cooley-Tukey  algorithm  while  starting  from  the  multiplierless 
32-point  a-FFT  such  that  larger  sized  a-FFTs  can  be  found.  In  preliminary  work,  we  have 
explored  the  use  of  this  approach  to  create  a  1024-point  a-FFT  (see  Figure  151)  that  reduces 
the  number  of  multipliers  from  0(N  log2  N )  down  to  0(N  ).  Although  this  preliminary 
algorithm  is  not  entirely  multiplierless,  it  offers  tremendous  savings  in  hardware  and  power 
consumption  by  removing  about  90%  of  the  parallel  multiplier  circuits  that  would  be  needed 
if  a  1024-point  exact  FFT  was  to  be  designed  in  digital  VLSI.  For  example,  for  N=  1024,  the 
exact  FFT  requires  about  10240  parallel  multipliers,  while  our  preliminary  algorithms  require 
at  most  1024  multipliers.  This  is  an  order  of  magnitude  reduction  in  the  number  of 
multipliers  on  chip. 
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•  Linear/quadratic  optimization  for  approximate-DFT  derivation:  For  larger  transforms,  such 
as  N  =  1024,  the  computational  search  space  is  3 1024x1024  «  7.9  •  1  O500297  .  an  intractably  huge 
number.  Therefore,  when  moving  to  large  values  of  N,  we  will  adopt  a  new  scheme  in  this 
seedling  project  where  we  use  methods  based  on  linear/quadratic  optimization  to  derive 
DFT-approximations  for  large  block  lengths  (N>  32). 


Figure  151:  1024-point  Magnitude  Beam  Responses 

(a)  1024-point  a-DFT  magnitude  beam  responses,  (b)  1024-point  magnitude  exact-FFT  beam 
responses,  (c)  differences  in  a-DFT  and  exact-DFT  beams  (N  =  1024),  (d)  two  pulses  with  time- 
delay,  (e)  a-DFT  based  complex  correlator,  and  (f)  cross-correlator  outputs  for  approximate- 

DFT  and  exact-DFT. 


Example:  High-Throughput  Approximate-FFT  Cross-Correlation.  Low  complexity  cross 
correlators  suitable  for  massive  computation  can  be  derived  from  approximate-FFT  algorithms 
by  means  of  the  convolution  theorem.  We  know  that  FFTs  are  used  for  reducing  the  complexity 
of  cross  correlation  in  the  time-domain  ( 0(N 2))  down  to  0(N  log  N )  in  the  Fourier  domain.  We 
reduce  the  correlation  complexity  even  further,  by  reducing  the  multiplier  complexity  of  the  a- 
FFT  down  to  O(N).  For  a  test  case  where  N  =  1024,  this  implies  a  dramatic  70  -  90%  smaller 
VLSI  circuit  and  power  consumption  over  conventional  FFT-based  cross  correlator  designs.  For 
a  predicted  clock  frequency  of  1  GHz,  this  implies  that,  for  pipelined  systolic  digital  CMOS 
realizations,  the  real-time  throughput  level  is  1  billion  1024-point  cross-correlations  per  second. 

The  replacement  of  the  FFT  with  the  proposed  N  -point  approximate-FFT  algorithms  for  large  N 
reduces  the  multiplier  complexity  values  for  both  cases  to  0(N )  without  increase  in  adder 
complexity.  In  VLSI  (45-nm  CMOS)  -  a  parallel  (complex)  multiplier  can  be  about  a  =  10  times 
larger  in  chip  area  compared  to  a  (complex)  adder,  which  implies  that  for  N  >  1024,  the 
fractional  savings  in  VLSI  area  can  be  an  order  of  magnitude. 
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6.  CONCLUSIONS 


The  use  of  approximate  computing  towards  computing  DFT  was  investigated  with  digital  array 
processing  for  antenna  beamformers  in  mind.  Starting  from  8-point  DFT,  the  approximations  for 
16,  32  and  64  point  DFT  were  proposed  and  evaluated  in  simulation.  These  proposed 
approximations  reduce  the  required  number  of  multipliers  to  zero.  A  new  low-complexity 
approximate-DFT  for  1024  points  was  also  proposed.  This  approximation  is  also  factorized,  and 
employs  1024  multipliers. 

For  the  approximate-DFT  algorithms  proposed,  a  sparse  factorization  has  been  found  that 
reduces  the  adder  complexity.  Theoretical  performance  has  been  quantified  with  respect  to  the 
exact  FFT  implementation,  and  frequency  bin-wise  analysis  has  been  given.  The  multiplierless 
transforms  reduce  the  well-known  0(NlogN )  multiplier  complexity  of  FFT  algorithms  to  zero 
for  an  N  -point  transform.  The  approximate  transforms  found  have  been  implemented  on  FPGA 
using  the  sparse  factorizations  found  for  each  case  which  leads  to  reduce  the  adder  complexity 
implementations  with  zero  multipliers  used.  Exact  counter  parts  of  each  size  of  FFT  were  also 
implemented  for  comparison.  The  hardware  resource  utilization  figures  have  been  reported  for 
both  approximate  and  exact  cases.  The  use  of  the  approximate  transforms  for  multi-beamforming 
in  linear  and  aperture  arrays  was  studied.  Theoretical  and  numerical  analysis  was  conducted  for 
both  1-D  and  2-D  cases  for  analyzing  the  performance  of  the  beams  and  side-lobes.  Several 
examples  have  been  shown  confirming  the  adoptability  of  the  proposed  transforms  in  multi¬ 
beamforming  applications. 

A  2.4-GHz  receive-mode  multi -beamforming  system  was  implemented  in  the  lab  to  obtain  the 
measured  beam  patterns  arising  from  the  proposed  approximate  transforms  for  verification  of  the 
algorithms.  A  16-element  linear  patch  antenna  array  was  designed  and  build  using  Nyquist 
element  spacing.  16  IQ  direct  conversion  receiver  chains  were  implemented  using  commercially 
available  off  the  shelf  components.  The  downconverted  signals  in  the  chains  were  then  sampled 
and  processed  using  ROACH-2  FPGA  processing  platform.  The  8-point  and  16-point 
approximate  transforms  were  used  in  FPGA  designs  to  calculate  and  measure  the  beam  patterns 
from  the  constructed  setup.  The  detailed  description  of  the  beamforming  setup  has  been  given 
and  the  measured  beam  patterns  have  been  reported.  The  beam  patterns  arising  from  the  exact 
DFT  designs  have  also  been  measured  and  presented  for  comparison.  It  can  be  seen  that  the 
measured  patterns  for  the  approximate  transform  closely  follow  the  beams  obtained  for  the  exact 
FFT-versions. 
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LIST  OF  ACRONYMS,  ABBREVIATIONS,  AND  SYMBOLS 


ACRONYM 

DESCRIPTION 

ADC 

analog-to-digital  conversion 

a-DFT 

approximate-discrete  Fourier  transform 

AESA 

active  electronically  scanned  array 

AFRL 

Air  Force  Research  Laboratory 

BRAM 

block  RAM 

CASPER 

Collaboration  for  Astronomy  Signal  Processing  and  Electronics  Research 

CLB 

configuration  logic  block 

CMOS 

complementary  metal-oxide-semiconductor 

COTS 

commercial  off-the-shelf 

DAC 

digital  to  analog  converter 

DAR 

digital  array  radar 

DARPA 

Defense  Advanced  Research  Projects  Agency 

demux 

demultiplexer 

DFT 

discrete  Fourier  transform 

DSP 

digital  signal  processing 

FFT 

fast  Fourier  transform 

FPGA 

field-programmable  gate  array 

I 

in-phase 

IF 

intermediate  frequency 

IQ 

in-phase  quadrature 

LNA 

low  noise  amplifier 

LO 

local  oscillator 

MTO 

Microsystems  Technology  Office 

OFDM 

orthogonal-frequency  division  multiplex 

PI 

Principal  Investigator 

Q 

quadrature 

RA 

Research  Assistant 

RF 

radio  frequency 

ROACH 

Reconfigurable  Open  Architecture  Computing  Hardware 

SCR 

software  controllable  register 

SMA 

SubMiniature  version  A 

SWaP 

size,  weight,  and  power 

ULA 

uniform  linear  array 

VLSI 

very-large-scale  integration 
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